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Abstract 



ji ' This paper proposes a new method for estimating sparse precision matrices in the 

c/3 ' high dimensional setting. This procedure apphes a novel Sparse Column-wise Inverse 

Operator (SCIO) to modified sample covariance matrices. We establish the conver- 

Kj^ gence rates of this procedure under various matrix norms. Under the Frobenius norm 

^«0 ' loss, we prove theoretical guarantees on using cross validation to pick data-driven 

QQ ' tunning parameters. Another important advantage of this estimator is its efficient 



computation for large-scale problems, using a path- following coordinate descent al- 
gorithm we provide. Numerical merits of our estimator are also illustrated using 
CN I simulated and real datasets. In particular, this method is found to perform favorably 

on analyzing an HIV brain tissue dataset and an ADHD resting fMRI dataset. 
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1 Introduction 

Estimating covariance matrix and its inverse is fundamental in multivariate analysis. Among 
many interesting examples are principal component analysis, linear/quadratic discriminant 
analysis, and graphical models. In particular, the inverse covariance matrix (precision ma- 
trix) plays important roles in the latter two examples, and we will focus on estimating the 
precision matrix in this paper. Driven by recent advances on data collecting technologies, 
one often need to draw statistical inference on datasets with very large number of variables, 
much larger than the sample size. Under this setting, also known as high dimensional set- 
ting, it is no longer viable to invert the sample covariance to estimate the precision matrix. 
Computationally, even if such operation could be carried out, inverting a very large ma- 
trix is expensive in memory and time costs. To address these challenges in computation 
and estimation, we propose a new column-wise procedure that enjoys efficient computation 
while maintaining desirable convergence rates. 

Let X = {Xi, . . . ,Xp) be a p-variate random vector with a covariance matrix I] and 
its corresponding precision matrix ft := S^^. Suppose we observe an independent and 
identically distributed random sample {Xi, . . . ,X„} from the distribution of X. Various 
regularizations on the likelihood criterion have been proposed to stabilize the estimate 
for n. In particular, the ii penalized normal likelihood estimator and its variants, which 
shall be called £i-MLE estimators, were considered in several papers; see, for example. 
Yuan and Lin (2007), Friedman et al. (2008), Banerjee et al. (2008), and Rothman et al. 
(2008). Friedman et al. (2008) developed an efficient R package, Glasso, to compute the 
£i-MLE. The convergence rate under the Frobenius norm loss was given in Rothman et 
al. (2008). Under the mutual incoherence or irrepresentable conditions, Ravikumar et al. 
(2011) obtained the rates of convergence in the elementwise £oo norm and spectral norm. 
Nonconvex penalties, usually computationally more demanding, have also been considered 
under the same normal likelihood model. For example. Lam and Fan (2009) and Fan et 
al. (2009) considered penalizing the normal likelihood with the nonconvex SCAD penalty 
(Fan and Li, 2001). The main goal is to ameliorate the bias problem due to £i penalization. 
One bottle neck in computing these estimators is its complex likelihood function. 

Recently, column-wise or neighborhood based procedures has caught much attention 
because of the advantages in both computation and convergence rates. In an important 
paper, Meinshausen and Biihlmann (2006) demonstrated convincingly a neighborhood se- 
lection approach to recover the support of $7 in a row by row fashion. For each row, the 
computation is reduced to run a £i penalized least squares, aka LASSO (Tibshirani, 1996). 
This then can be solved efficiently via path-following coordinate descent (Friedman et al. 



2008b). Yuan (2009) replaced the lasso selection by a Dantzig type modification, where first 
the ratios between the off-diagonal elements Uij and the corresponding diagonal element 
Uii were estimated for each row i and then the diagonal entries uu were obtained given the 
estimated ratios. Convergence rates under the matrix ii norm and spectral norm losses 
were established. This procedure can be solved via standard packages on linear program- 
ming. Cai, Liu and Luo (2011) proposed a procedure, CLIME, which seeks the sparsest 
precision matrix (measured by the £i norm) within a modified feasible set of the ii-MLE 
estimator. Their procedure is casted as a column-wise procedure, and each column is es- 
timated via linear programming. They established the convergence rates of various norms 
without imposing the mutual incoherence conditions (Ravikumar et al. 2011), and proved 
improved convergence rates upon the £i-MLE estimator when X follows polynomial tail 
distributions. Even though Yuan (2009) and CLIME can be casted as linear programming, 
these problems are still computational expensive for really large p. 

All these penalization methods require choosing some appropriate tuning parameters, 
also known as penalization parameters. Despite that these procedures are justified using 
asymptotic and finite-sample theories before, understanding of these procedures in prac- 
tice is rather limited, as the theories are usually built on some theoretical choices of tuning 
parameters that cannot be implemented in practice. On the other hand, cross validation 
is probably the most widely employed data-driven scheme for choosing such parameters, 
however, the corresponding theory is sparse. Bickel and Levina (2008) analyzed the per- 
formance of thresholding covariance matrices, where the threshold is chosen using partial 
samples. A different approach using large sample theory was employed by Cai and Liu 
(2011), and they provided adaptive thresholding for covariance matrix estimation using 
the whole samples. Unfortunately, these results cannot be simply extended to the inverse 
covariance setting, due to the problem complexity. Exploiting the simplification brought 
by our column-wise procedures, this paper is among the first to demonstrate that cross 
validation is theoretically justified in choosing the tunning parameters for estimating the 
precision matrix. 

In the present paper, we develop a simple column-wise procedure, called Sparse Colum- 
nwise Inverse Operator (SCIO), to study estimation of the precision matrix f2. This pro- 
cedure works for both sparse and non-sparse matrices without restricting to a specific 
sparsity pattern. We establish theoretical guarantees for the SCIO estimator. Rates of 
convergence in spectral norm as well as elementwise £oo norm and Frobenius norm are es- 
tablished. A matrix is called s-sparse if there are at most s non-zero elements on each row. 
It is shown that when fi is s-sparse and X has either exponential-type or polynomial-type 



tails, the error between our estimator Q and Q satisfies ||r2 — 0||2 = Op(sy^\ogp/n) and 
|r2 — r2|oo = Op ( A/log p/n), where || ■ II2 and | ■ |oo are the spectral norm and elementwise 
loo norm respectively. The SCIO method can also be adopted for the selection of graphical 
models (Lauritzen, 1996), where the elementwise i^o norm result is instrumental. 

A significant advantage of the SCIO estimator is its computational efficiency for large- 
scale problems, thanks to its column-by- column computation. From a pure computational 
point of view, column-by- column procedures are examples of the general divide-and-conquer 
principal for large-scale computation. The estimator can be obtained one column at a 
time by solving a simple objective function for each column, and the resulting matrix 
estimator is formed by combining the vector solutions into a matrix. The final step is 
to symmetrize the matrix using a simple operation, which we used in Cai, Liu and Luo 
(2011). An improvement of computation comes from the key observation that the simple 
objective function for each column can be efficiently solved using the iterative coordinate 
descent algorithm, where each update is expressed in closed form. Indeed, this column- 
by-column computation principal has been employed for solving the £i-MLE in its efficient 
R implementation Glasso by Friedman et al. (2008). However, they have two layers of 
iterations: one outer layer of iterations across the columns and an inner one to solve 
a LASSO problem iteratively using coordinate descent. The SCIO estimator no longer 
needs the outer iterations, and thus we observe improved computational speed in all of our 
examples. An R package of our method has been developed and is publicly available on 
CRAN. 

The rest of the paper is organized as follows. In Section |2l after basic notations and 
definitions are introduced, we present the SCIO estimator. Theoretical properties including 
the rates of convergence are established in Section [3l A data-driven choice of the tuning 
parameter is discussed in Section |U where we prove theoretical guarantees of using cross 
validation. The coordinate descent algorithm for solving SCIO is introduced in Section |5l 
and we also demonstrate its numerical performance through simulation studies and real 
data analyses. Further discussions on the connections and differences of our results with 
other related work are given in Section [6l The proofs of the main results are given in 
Section [71 

2 Methodology 

In this section, we motivate the SCIO estimator. At the population level, given the pop- 
ulation covariance matrix S, we define the column loss functions for every i = 1,2, ... ,p, 



which take the form 

/,(S,S) = i/3fSA-efA (1) 

where B = (/3i, /32, • • • , /3p) . Each function fi in ([T]) is strictly convex in (3^ as S is strictly 
positive-definite; more importantly, the minimal values of each fi are achieved at /3j's that 
satisfy the following equality for each i 

T.(3, - e, = 0. (2) 

It is straightforward to see that the columns of the precision matrix fi satisfy these equal- 
ities, and thus minimize all the functions in ([1]). In fact, this is also the unique solution of 
(l2l) if S is full rank, given by the inversion formula cjj = S~ e^ = ilej. 

Certainly, because Xl is usually unknown, the functions in the form ([T]) and the inversion 
formula cannot be directly applied to produce proper estimators of fi. However, we can 
replace with the sample covariance matrix S to produce the corresponding sample versions 
of©: 

One intuitive idea is to minimize the above function to produce proper estimators for 17. 
But this is not efficient because it does not utilize the assumption that the underlying Q. 
is sparse, and more importantly there might be multiple solutions when S is not full rank. 
This happens in high dimensional problems where p is much larger than n. 

Motivated by recent developments on using the li norm to estimate the precision matrix 
(Friedman, Hastie, and Tibshirani, 2008; Cai, Liu and Luo, 2011), we use the ii penalty 
to enforce the sparsity of each column-wise solution via minimizing the following objective 
function 

^/3^S/3-ef/3 + A™|/3|i (3) 

for each i = 1,2, ...,p, where the penalization parameter A„j > can be different for 
different columns. By taking the subgradient of ([3]), the minimal values satisfy the following 
constraint ior i = 1,2, . . . ,p, 



S/3 



< A. 



This is exactly the constraint used for the CLIME estimator by Cai, Liu and Luo (2011). 
We now proceed to formally define the SCIO estimator. Let (3^ be the solution to the 
following equation: 

^, = argmin {i/3^S/3 - ej(3 + A™|/3|i|, (4) 



I3&1RP 



where f3 = {f3i, . . . , f3p)'^ . The fully data-driven choice of A™ is introduced in Section 
4. Let /3j = {I3ii, . . . , Pip)'^. Similar to the CLIME estimator, the solution of (jl]) is not 
necessarily symmetric. To obtain the SCIO estimator Cl = {Cdij)pxpi "we will employ the 
same symmetrization step as in CLIME, 



^ij — (^ji 



4/{l4l<|/3,.|} + /3,./{l4l>|/3,.|}. (5) 



The choice of A„j, as will be given in Section 4, is adaptive to the columns of precision 
matrix. In real applications, the sparsity in each column may be different dramatically. The 
adaptive choice of the tuning parameter is chosen using our column-by-column loss. The 
Glasso estimator by Friedman, Hastie, and Tibshirani (2008), on the other hand, does not 
provide an inexpensive implementation like ours because they aim to compute the whole 
matrix using a likelihood loss of all entries, which consists of determinant computation for 
example. 

3 Theoretical guarantees 

In this section, we state the convergence rates of Cl. The result on support recovery 
is also given. We begin with basic notations and definitions. Throughout, for a vector 
a = (ai, . . . , Op)-'" G IR^, define |a|i = ^^=i|aj| and |a|2 = \Y7j=i'^']- -^^^ ^ matrix 
A = {ttij) G IRP^'^, we define the elementwise /qo norm l^loo = maxi<j<pi<j<g |ajj|, the 
spectral norm \\A\\2 = sup|x|2<i |^x|2, the matrix ii norm ||A||li = ^'^^^i<j<qYl^=i kijl' 



the matrix oo norm ||A||oo = uiaxi<i<g Yl^=i kijl; ^^e Frobenius norm \\A\\f = \/J2i j ^fji 
and the elementwise ^i norm ||A||i = X]f=i Sf=i \^i,j\- ^ denotes a p x p identity matrix. 
For any two index sets T and T and matrix A, we use Aj,rpt to denote the \T\ x \T \ matrix 
with rows and columns of A indexed by T and T respectively. The notation A y means 
that A is positive definite. For two real sequences {a„} and {bn}, write a„ = 0{bn) if there 
exists a constant C such that |a„| < C\bn\ holds for large n, an = o(6„) if lim„_yoo dn/bn = 0. 

3.1 Convergence rates of r2 — Jl 

We first introduce some conditions. The first condition is on the sparsity of 17. Let Si be 
the support of cj.^j, the i-th column in fl. Define the Sp-sparse matrices class 

p 
U=\flyO: max V /{u;„- 7^ 0} < Sp, \\n\\L, < Mp, 



i<j<P- 

1=] 



Cq^ < Amin(f7) < Amax(r2) < Coj, 



where cq is a positive constant. 
(CI). Suppose that ft eU with 



'^ = "(^/l^) (^^ 



and 



max llS^c^s^S^ 5^ lU < 1 - a (7) 



for some < a < 1. 

As we will see from Theorem [H condition (|6]) is required for the consistency of the 
estimator. Condition ([7]) is a mutual incoherence or irrepresentable condition. Such a 
condition is almost necessary for support recovery through the penalization method. A 
similar irrepresentable condition was imposed by Ravikumar et al. (2011) for analyzing 
Glasso. We will compare ([7]) to their irrepresentable condition in Remark 3. 

Let Y = (Yi, . . . , Yp)'^ = nX — flfi. The covariance matrix of Y is thus f2. The second 
condition is on the moment of X and Y. 

(C2). (Exponential-type tails) Suppose that logp = o{n). There exist positive numbers 
7] > and K > such that 

Eexp fri{Xi - fii)A < K, Eexp (r]YA <K for all 1 < ? < p. 

(C2*). (Polynomial-type tails) Suppose that for some 7,Ci > 0, p < cin'^, and for some 
5>0 

E|Xi - /i»|^^+^+^ < K, E|F,|^T+^+^ < K for all i. 

We will assume either one of these two types of tails in our analysis. These two conditions 
are standard for analyzing precision matrix estimation, see Cai, Liu and Luo (2011) and 
references within. 

The first result is on the convergence rate under the spectral norm. It implies the conver- 
gence rates of the estimation of eigenvalue and eigenvector, which is essential in principle 
component analysis. The convergence rate under spectral norm is also required in the 
classification problem, wherein the estimation of the precision matrix plays an important 
role. 

Theorem 1 Let Xni = Cq i/log p/n with Cq being a sufficiently large number. Under (CI) 
and (C2) (or (C2*)), we have 



\n - ny < CiMpSp^ ' °^^ 



n 



with probability greater than 1 — 0{p ^ + n ^^^), where Ci > depends only on co,ri,Co 
and K . 



Remark 1. If MpSp\ -^^^ = o(l), then Q is positive definite with probabihty tending 



to one. We can also revise f2 to dp with 

f2, = f2 + pi, (8) 

where p = (|Amin(f^)| + ^~^^^)-?^{Amin(f^) < 0}. Then ft is always positive definite. By 
Theorem [T|, we have p < CMpSp\/ -^^ with probability greater than 1 — 0{p~^ + n~ 
and hence 



\np-n\\2<CMpSp^' °^^ 



n 

Such a simple perturbation will make the estimator be positive definite. The later results 
concerning support recovery and the convergence rates under other norms also hold under 
such a perturbation. To improve numerical stability, this perturbation strategy (JHl) can 
also be applied to the sample covariance as long as p = O in~^^'^ \og^'^ p), and all the 
theoretical results also hold under such a perturbation, see also Cai, Liu and Luo (2011). 
Remark 2. Ravikumar et al. (2011) imposed the following irrepresentable condition 
on Glasso estimation: for some < a < 1, 

max|E(<l>e$i)E($s$^)-'|i<l-«, (9) 

where S is the support of f2 and ^{j,k) = ^j^k — EXjX^. To make things concrete, we 
now compare our conditions using the examples given in Ravikumar et al. (2011): 

1. In the diamond graph, let p = 4, an = 1, (T23 = 0, a^ = 2p^ and ctjj = p for all 
^ 7^ h ihj) 7^ (2,3) and (2,4). For this matrix, (E]) is reduced to 4|p|(|p| + 1) < 1 
and so it requires p G (—0.208,0.208). We prove that our condition (JTj) only needs 
pe (-0.5,0.5). 

2. In the star graph, let p = 4, an = 1, aij = p for j = 2, 3, 4, aij = p^ for 1 < i < j < 4. 
For this model, (Il2]) requires |p|(|p| + 2) < 1 (p G (-0.4142,0.4142)), while our 
condition ([7]) holds for all p G (—1,1). 

We also have the following result on the convergence rates under the element-wise l^o norm 
and the Frobenius norm. 



Theorem 2 Under the conditions of Theorem Ui we have with probability greater than 



\n-n\^<CMp\l^^ (10) 



and 



-\\{i-n\\l<csj^^. (11) 

p n 

Remark 3. Note that the convergence rate under the Frobenius norm does not depend 
on Mp. On the other hand, Cai, Liu and Zhou (2011) obtained the minimax lower bound 
resuh when X ~ iV(/x, S) 

- min max Eliri- rill p > cM^^Sp-^. (12) 

The rate in ( ITT]) is faster than the rate in (1121) since we consider a smaller matrix class. In 
Ravikumar et al. (2011), they proved that the Glasso estimator ^ciasso has the following 
convergence rate 

1„A r^,|2 ^ /,.2_ logP 



I f^Giasso — ^11 2 = Op KpSp I, (13) 

p' \ n / 

where Kr = Hr^^lUi and F = (S ® S)s5. Our convergence rate is faster than their rate in 
(IT3D if Kr ^ oo. 



3.2 Support recovery 

As discussed in the introduction, the support recovery is related to the Gaussian graphical 
model selection. The support of ft is also recovered by SCIO. Let \i/ = {{i,j) : (jJij 7^ 0} be 
the support of ft. Let 

The next theorem gives the result on support recovery. 

Theorem 3 (i). Under the conditions of Theorem 1, we have \i/ C v[f with probability 
greater than 1 — 0{p~^ + n~^/^). (ii). Suppose that for a sufficiently large number C > 0, 



9„ := min \ujiA > CM^J^^. (14) 

Under the conditions of Theorem [21 we have ^ = ^ with probability greater than 1 — 
0(p-i + n-^/8). 
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4 Data-driven choice of Xni 

This section introduces the procedure on the choice of the tuning parameter Xni- We 
consider the following cross validation (CV) method for the analysis, similar to Bickel and 
Levina (2008). Divide the sample {X^; 1 < k < n} into two subsamples at random. Let 
rii and n2 = n — rii he the two sample sizes for the random split satisfying ni x ■^,2 x n, 
and let Sy, E2 be the two sample covariance matrices from the wth split, for v = 1, . . . ,H, 
where if is a fixed integer. Let /3j (A) be the estimator minimizing 

^(^) = ^ E [^(^I(^))^^2^:(A) - ef^:(A)] . (15) 

v=l 

For implementation purposes, we can divide an interval (0, a] by Ai < ■ ■ ■ < X^, where 
Aj = j^a. The final tuning parameter is chosen by 

Aj = argmin R{Xj). (16) 

{A,:l<i<7V} 

The choice of Aj could be different for estimating different columns of the precision matrix. 
It is thus adaptive to the sparsity of each column, comparing with the standard Glasso esti- 
mator. The theoretical property of Glasso is hard to analyze under CV. For the estimation 
of the covariance matrix, Bickel and Levina (2008) obtained the convergence rate under 
the Frobenius norm for the threshold estimator of covariance matrix, where the threshold 
is based on partial samples. However, it had been an open problem on the convergence 
rate for estimating the precision matrix when the tuning parameter is chosen by CV. Our 
Theorem |4] solves this problem by showing that the estimator based on the partial samples 
and Aj from ( fT5l) can attain the optimal rate under the Frobenius norm. For simplicity, we 
let H = 1 as in Bickel and Levina (2008). Let n[ := (ulj^) = (^^^(Ai), . . . ,^p(Ap)) be the 
corresponding column solutions when the tuning parameters are chosen using flT6|) for each 
column. The matrix Cl^ is symmetrized as before. 

The following theorem shows that the estimator Cl = {(^h) attains the optimal rate under 
the Frobenius norm. 

Theorem 4 Under the conditions of TheoremUl logiV = O(logp), y^n/ \ogp = o{N) and 
X ~ N{fj,, S), we have as n,p ^ 00, 

P 



^110^ oi|2 n ( ^°SP\ 
V \ n / 
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Theorem 4 provides a parallel result to Bickel and Levina (2008) 's Theorem 4, where 
they obtained the same rate for for estimating the covariance matrix under CV. Using 
similar arguments of theirs, this result can be extends to multiple folds. The assumption 
that X ~ N{fi, I]) can be extended to the sub-Gaussian tails or the polynomial-type tails. 
The normality is only used for simplifying the proof. Theorem H] is the first result on the 
convergence rate for estimating the precision matrix based on CV. 

5 Numerical examples 

We will first briefly introduce our algorithms for solving SCIO. We will then illustrate the 
numerical merits in estimation and computation using simulated and real datasets. 

Recall the asymmetric estimator B = (/3j) from (jl]), and the final SCIO estimator is 

obtained simply by applying symmetrization ([5]) to B. We compute each column /3j by 

A = argmin (^/3fS„A - /3f e, + A \\f3,\\,] . (17) 

This objective can be solved easily using iterative coordinate descent. To simplify the 
notation, we will use (3 to denote /3j in ( IT7|) for a fixed i, as we will apply the same 
algorithm for each column i. In each iteration, we fix all but one coordinate in /3, and 
optimize over that coordinate. Without loss of generality, we consider optimizing over the 
pth coordinate /3p while all other coordinates of /3 (denoted by /3_p) are fixed, the solution is 
given in explicit form by the following lemma. The solution for optimizing other coordinates 
while fixing the remaining ones are similar, simply by permuting the matrix to have that 
coordinate being the last one. We will iterate through coordinates until convergence. 

Lemma 1 Let the subvector partition (5 = {l3_p, /3pj and partition S„ similarly as follows 
Fixing f3_ , the minimizer of (fT7|) is given by 



/3p = r(l{p = z}-/3^,Si2,A)/S 



22 



where the soft thresholding rule T(x, A) = sign(x)(|x| — A). 

We implement this algorithm in an R package SCIO, and it is publicly available through 
CRAN. All the following numerical computation is performed using R on an AMD Opteron 
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processor (2.6 GHz) with 32 Gb memory. The Glasso estimator is computed using its R im- 
plementation glasso (version 1.7). We have also implemented the path-following strategies 
(Friedman et al 2008) in SCIO similar to the Glasso implementation, where the solutions 
are obtained in the decreasing order of A's and the initializer for each A is set to the con- 
verged solution under its predecessor A. We set the numerical accuracy to be 10~^ for both 
SCIO and Glasso, so that iterations stop in both algorithms when the changes are less than 
the accuracy. 

5.1 Simulations 

We compare the performance of our estimators with Glasso using simulations. The covari- 
ance matrix that we use to generate data contain two block diagonals, where the second 
block is 4 times the first one. Similar examples have been used in Cai and Liu (2011) in 
studying adaptive covariance estimation. The first block is generated from the following 
models respectively. 

1. decay: [17*].^. = O.GI'-^'L 

2. sparse: Let 0,0 = + SI, where each off-diagonal entry in O is generated indepen- 
dently and equals to 0.5 with probability 0.1 or with probability 0.9. 6 is chosen 
such that the conditional number (the ratio of maximal and minimal singular values 
of a matrix) is equal to p. Finally, this matrix block is standardized to have unit 
diagonals. 

3. block: A block diagonal matrix with block size 5 where each block has off-diagonal 
entries equal to 0.5 and diagonal 1. The resulting matrix is then randomly permuted. 

For each model, 100 observations are generated from multivariate Gaussian distribution as 
a training data set, and 100 additional observations are generated from the same model 
as a validating data set. Using the training data, a series of estimators with 50 different 
values of A are computed. For a fair comparison, we first pick the tunning parameters of 
Glasso and SCIO by minimizing the Bregman loss respectively on the validation sample. 
The Bregman loss is defined by 

L(s, n) = {n, s) - logdet(ri). 

We also compare with our theoretically justified CV scheme with the column-wise loss 
f lT5|) . The theoretical guarantee of this CV method is proved in Theorem |H The resulting 



12 



estimator is denoted by SCIOcv. We consider different values oi p = 50, 100,200,400 and 
replicate 100 times. 

Table [1] compares the estimation performance of SCIO, SCIOcv, and Glasso under the 
spectral norm and the Frobenius norm. It shows that SCIO almost uniformly outperforms 
Glasso under both norms. The SCIO estimator shows slightly worse performance in the 
Block model but the difference is very small. The SCIOcv estimator is almost always the 
second best, probably because the Bregman loss is the correct likelihood criterion here. 

The support of the inverse covariance matrix carries important consequences for the 
graphical models. The frequencies of correct zero/nonzero identification are summarized in 
Table [2J The true negative rates (TN%) shows that the SCIO estimates are sparser than 
Glasso estimates. To illustrate this, we plot the heatmaps of support recovery in Figure 
[1] using p = 100 as an representing example. These heatmaps confirm that our SCIO 
estimates are sparser than Glasso. By visual inspection, these SCIO estimates also tend 
to be closer to the truth. They are robust in these two-block models where the sparsity of 
the estimated two blocks are not interfered by their scale, whereas Glasso has shown some 
interference and artificial stripes appearing in the estimates under the Sparse model. The 
SCIOcv estimators almost always have the sparsity patterns between the SCIO and Glasso 
estimators. 

5.2 HIV-1 associated neurocognitive disorders 

Antiretroviral therapy (ART) has greatly reduced mortality and morbidity of HIV patients; 
however, HIV-1 associated neurocognitive disorders (HAND) are common among patients, 
which cause greatly degradation of life quality. Borjabad et al (2011) analyzed gene expres- 
sion arrays on post-mortem brain tissues. They showed that patients with HAND on ART 
have many fewer and milder gene expression changes than untreated patients, and these 
genes are postulated to regulate certain pathways. The dataset is publicly available from 
Gene Expression Ominibus (GEO) under the serial number GSE28160. We here apply 
our graphical models to study how their genetic interactions/pathways are altered between 
treated and untreated patients, and compare with other methods on classification of future 
samples. 

This dataset contains gene expression profiles of post-mortem brain tissues using two 
biological replications. The first replication dataset contains 6 control (healthy) samples, 
7 treated HAND samples, and 8 untreated HAND samples; the second contains 3 controls, 
5 treated, and 6 untreated. The data are preprocessed by GEO and then log-transformed 
using Bioconductor in R. We will use the first replications as a training set, and test the 
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Table 1: Comparison of average (SD) losses of SCIO, SCIOcv, and Glasso over 100 simulation runs. The best performance 
is highlighted in bold. 



Spectral Norm 



hii- 





Decay 




Sparse 






Block 




p 
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Table 2: Comparison of average support recovery (SD) of SCIO, SCIOcv, and Glasso over 100 simulation runs. 
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Figure 1: Heatmaps of support recovery over 100 simulation runs (black is 100/100, white 
is 0/100). 
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classification performance of 3 classes on the second replications. The class label is denoted 
by k, where k=l,2,3 for control, treated and untreated respectively. The model building 
procedure is similar to our previous paper Cai, Liu and Luo (2011). On the training 
data, we first compare pair-wise mean differences between 3 classes for each gene using 
Wilcoxon's test, and select the top 100 genes with the most significant p- values in any 
of the tests. Based on these 100 genes and the training data, we estimate the inverse 
covariance matrix Clk for each class k using SCIO and Glasso. For a new observation 
X from the testing dataset, the classification score for each pair of class {k, k') is by the 
log-likelihood difference (ignoring constant factors) 

Sk,k' {X) = -{x- Xk) ^k {X - Xk) + {x- Xk') n^' {X - Xk') 

+ logdet (nk^ - logdet ({ik'^ 

where Xi is the mean vector for class / using the training data, I = k,k' and k ^ k'. 
This score is essentially the log-likelihood differences under two estimated multivariavate 
normal distributions. Because each class has almost the same number of observations in 
the training data, we will assign the label k if Sk,k' > and k' otherwise. 

Figure [2a] compares classification accuracy of treated and untreated HAND using SCIO 
and Glasso. The results comparing two HAND groups with the controls respectively are 
not shown because we have constant area-under-the-curve values equal to 1 in both com- 
parisons. Because the number of nonzero off-diagonal elements depends on the choice of 
penalization parameters in each method, we plot the classification accuracy against the av- 
erage percentages of nonzero off-diagonals (or connected edges) of these two classes (treated 
and untreated) under each A. The SCIOcv estimator (not shown) only differs from SCIO 
by the choice of picking A, and it is irrelevant here as we show the performance across all 
A's. This figure shows that Glasso and SCIO have similar performance under most of the 
sparsity percentages, but SCIO outperforms Glasso using the same number of connected 
edges in some cases. The SCIO estimators have also stable classification performance even 
if the number of connected edges increases. We didn't plot the performance of Glasso with 
more than 14% connected edges (smaller penalization parameters), because we found the 
Glasso algorithm didn't converge within 120 hours to achieve the same sparsity percentages 
on the same dataset. As a side comparison with other classification algorithms, we build 
other classifiers using the same selected 100 genes from the training data, including random 
forest (Brieman, 2001), AIC penalized logistic regression, and LI penalized logistic regres- 
sion with 5-fold cross validated penalization parameters. Their classification accuracies are 
78.6%, 90.9% and 45.6% respectively. Our classification rule compares favorably as well 
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Figure 2: Comparison of classification accuracy and running times using SCIO and Glasso 
for the HIV dataset. Red solid line is SCIO and blue dotted line is Glasso. 
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with these competing methods on this dataset. 

To compare the computation time, Figure [2b] plots the running times of Glasso and SCIO 
against the percentages of connected edges for the same reason as before. Because Friedman 
et al (2008b) showed that path-following algorithms compute a sequence of penalization 
parameters to a small value much fast than computing for the single small value, we use 50 
log-spaced penalization parameters in each computation. They range from the largest (0% 
edges) to the values corresponding to the designated percentages of edges, including 5%, 
10%, 14%, 20%, 30%, 40%, 50% and 60%. As reported before, we didn't plot the running 
times for Glasso beyond 14% because it didn't converge. SCIO takes about 2 seconds more 
than Glasso when computing for 5% edges, but is much fast than Glasso for 10% and more 
edges. It compares favorably in the 14% case where SCIO takes only a quarter of the time 
of Glasso. The running time of the SCIO estimator grows linearly with the number of 
connected edges, while Glasso has shown an exponential growth in time. 

To compare the graphical models recovered. Figure 3 plots the supports with a repre- 
senting case of 10% connected edges using both SCIO and Glasso. Each subject class has 
different connection patterns as shown by both SCIO and Glasso, and both methods also 
recover some shared patterns for each class. However, it is noted that Glasso tend to have 
artificial stripes in the pattern, which is also observed in simulations. 



Figure 3: Comparison of support recovered by SCIO and Glasso for the HIV dataset, when 
we 10% of the edges are connected in all plots. 



(a) Control-SCIO 



(b) Treated-SCIO 



(c) Untreated-SCIO 




■ •—fy.'"..-^.-J' 



^: ■■;■*'■' -a; - 










(d) Control-Glasso 



(e) Trcatcd-Glasso 



(f) Untrcatcd-Glasso 




■'.■ii;""'rf---3-7;-.rTx:^'-^^'- ■ ■" ■: ■■'■i-'^r-' ■■:■.- 



■■■■.■ ■■■■■:■ ; :r . ■ T X : — - ■■ ^ i^i--- ■ .-:■ 't-i-^' ^-■- ■■- ■■-■■-■-■--. ! ■. -^^^-Xj- ' 



19 



5.3 Attention deficit hyperactivity disorder 

Attention Deficit Hyperactivity Disorder (ADHD) has substantial impairment among about 
10% of scliool-age cliildren in United States. Dickstein et al (2011) used resting state 
fMRI scans to sliow tliat tliere are differences in correlations between brain region among 
typically developed children and children with such disorders. The ADHD-200 project 



(http://fcon_1000.projects.nitrc.org/indi/adhd200/) released fMRI resting data of healthy 



control and ADHD children to encourage research on these diseases. We apply our method 



using the preprocessed data from neurobureau (http://www.nitrc.org/plugins/mwiki/index.php/neurobure 
from one of the participating center, Kennedy Krieger Institute. There are 61 typically- 
developing controls (HC), and 22 ADHD cases. The preprocessing steps are described in 
the same website. After preprocessing, we have 148 time points from each of 116 brain re- 
gions of each subject. We here want to study the precision matrix pattern for each subject, 
as it reveals conditional independence and is more relevant to explore direct connectivity. 

We estimate the inverse covariance matrices using SCIO and Glasso with varying 
penalty parameters for each subject. As reported before, the connection patterns de- 
pend on the choice of penalty, and we thus compare patterns with the same percentage of 
connections for each subject. Figure H] illustrates the average heatmaps across subjects of 
ADHD and HC respectively recovered by SCIO and Glasso. We let all individual precision 
matrices to have 30% connected edges as a representing case. Both methods have shown 
that ADHD has increased number of nonzero entries off the diagonal comparing with HC. 
Both methods recover similar patterns of nonzero entries close the diagonal, but SCIO 
tends to be less noisy on the entries far away from the diagonals. 

The running times for both methods are compared in Figure O As reported before, for 
each subject, we use path following algorithms in both methods to designated connected 
edges, including 10%, 20%, 30%, 40%, 50% and 60%. We then plot the average running 
times and standard errors. This plot shows that the running times of SCIO grows almost 
linearly. Comparing with Glasso, SCIO is about 2 times faster with 60% connected edges. 

6 Discussion 

We introduce the SCIO estimator in this paper. Theoretical guarantees of this estimator are 
established under various norms. We present a path-following algorithm for computing this 
estimator fast. The advantages of our estimators are also illustrated using both simulated 
and real examples. 

The choice of the tuning parameter is an important problem in applying penalization 
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Figure 4: Heatmaps of support recovered by SCIO and Glasso for the ADHD dataset, 
when we set 30% of the edges are connected in each subject. 
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Figure 5: Comparison of average running times for the ADHD dataset. 
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procedures, despite numerous theoretical results. This paper is among the first to demon- 
strated that cross validation provides theoretical guarantees that the resulting estimator 
achieves the n~^/^(logp) rate under the Frobenius norm. This rate may not be improved 
as we suspect it should be the minimax optimal rate. Moreover, it is very interesting to 
study whether such rate can also be achieved in other matrix norms, such as the operator 
norm, using data-driven choice of A. These results will further bridge the gap between 
theory and practice of these penalization methods. 

The rate we provide in Theorem 3 coincides with the minimax optimal rate in Cai, Liu 
and Zhou (2011). However, note that U together with ([7]) is actually a smaller class of 
matrices than theirs. It is interesting to explore if their minimax rate can be improved 
in this important sub-class, though the current rate is already the desirable rate in high 
dimensional inference in general. 

Penalized regression and inverse covariance estimation are closely connected problems 
in statistics. During the preparation of this paper. It comes to our attention that Sun 
and Zhang (2012) recently applied their recently developed penalized regression procedure. 
Scale Lasso, to the inverse covariance matrix estimation. Their procedure is aiming to adapt 
to the variances of the errors in regression. It is interesting to study if their procedure can 
also be applied under our column loss. 

We considered enforcing sparsity via the li norm due to computational concerns. It 
has been pointed by several authors that the ii penalty inheritably introduces biases in 
estimation, and thus it is interesting to replace the £i norm by other penalty forms, such as 
Adaptive Lasso (Zou, 2006) or SCAD (Fan et al, 2009, Zhou et al, 2009). Such extensions 
should be easy to implement because our procedure only employs column-wise operations. 
We are currently implementing these methods for future releases of our R package. 

There are other interesting directions to expand the current models. It is interesting 
to study precision matrix estimation when the data are generated from some hidden factor 
models, where the covariance estimation problem was studied by Luo (2011). Recently, Guo 
et al (2011) introduced a new penalty to jointly estimate multiple graphical models, assum- 
ing that these graphs have some shared patterns. It is interesting to extend our approach 
to that setting. It is also interesting to consider extending SCIO to the nonparanomral 
case for high dimensional undirected graphs (Liu et al, 2009). 

This paper only considers the setting that all the data are observed. It is an interesting 
problem to study the inverse covariance matrix estimation when some of the data are 
possibly missing. It turns out that the SCIO procedure can also be applied to the missing 
data setting, with some modifications. Due to the space limitation, we will report these 
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results elsewhere. 



7 Proof of main results 



To prove the main results, we need the following lemma which comes from (28) in Cai, Liu 
and Luo (2011). 

Lemma 1 Under (C2) or (C2*), we have for some C > 0, 



P( max {|a., - a.,|/(4/Vjf )} > cJ^) = 0{p~^ + n^^'% 

Let ft = (uij) = (cji, . . . ,uJp), Si be the support of uji and uj^. = {ujji]j E Si)'^. The 
following lemma comes from Cai, Liu and Zhou (2011). 

Lemma 2 Under (C2) or (02") and Cq^ < Ainin(f^) < Amax(f^) < cq, we have for some 
C>0, 

P( max \±s.>cs.u^s. - e^Joo > Cxf^) = 0{p~' + n~'/'). 

Proof of Theorem 1. For the solution /3j, it satisfies that 

5^/3j — ej = —XnZi, 
where Zi =: {Zu, . . . , Zpi)'^ is the subdifferential 9|/3j|i satisfying 

[ 1, ^J^ > 0; 

% = I -1, ^,. < 0; 

[ G[-l,l], ^,, = 0. 

Define f3^ be the solution of the following optimization problem: 

^°= argmin | Vs/3 - ef/3 + A„|/3|i|, 

where supp(/3) denotes the support of (3. We will show that /3,j = (3^ with probability 
greater than 1 — 0{p~^ + n~^^^), and hence Theorem 1 can be obtained from Theorem 2. 

Let Zg_ is the subdifferential 9|/3j |i on Si. We define the vector Zi = {Zu, . . . , Zpi)'^ 
by letting Zji = Z°^ for j G iSj and 

% = -A,-i(S^:), for J G5f. 
24 



By Lemma [3], for j G S^ and some r < 1 , 

\Zji\ < r < 1 (18) 

with probability greater than 1 — 0{p~^ +n~^^^). By this primal-dual witness construction 
and fl22p . the theorem is proved. | 

Lemma 3 With probability greater than 1 — 0{p~^), we have 

\Zji\ < l-a/2 
uniformly for j G iSf . 
Proof. By the definition of Zi, we have 

ts,^sM.-es^ = -X^Zs^ (19) 

and 

^sfxs.f^s^ = -KZs^. (20) 

Write ^ as 

This implies that 

PI - iVs, = lls^^s. ( - XnZs, - (S5.X5. - ^S^xsMPI - ^<sJ - S5,x5.C^5, + 65.) • (21) 

By ([6]), Lemma [Hand Lemma [21 we have with probability greater than 1 —0{p^^ +n~ 



-5/8) 



\/3s, -^S,\2 <C^^Sp\0gp/n + 0{l)\|3s^-UJs^2■ 

This implies that 



\(3s^-u,s^2<C^Sp\ogp/n. (22) 

By f[20l) and the above equation, we have 

= T- C^SfxS, - '^SfxS, ) {f3s, - ^S, ) - '^SfxS, '^S^xS^ ^Si 
-^'^SfxS^^SiXsS'^SiXS, - ^SiXS,)iPs, ~ ^S,) 
-^'^SfxS^'^SiXSii'^SiXS.^Si - esj 
+ ^i^SfxS^ - ^SfxS,)^Sr 
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Since USsc^s^S^^^x^Jloo < 1 -a and \Zs,\oo < 1, we have \i:s-xS,'^sl>iS,Zs,\oo <l-a. By 
fl22]) and Lemma [H we obtain that with probabihty greater than 1 — 0{p^^ + n"*^/®), 

\{^stxs. - ^s<r>iS,){K^ - W5,)|oo < Csp\og'p/n. (23) 

This, together with Lemma [21 imphes (TTSl) . | 

Proof of Theorems [2] and [3l By the proof of Theorem [1], we have (3^ = f3^ . Note that 

A - ^. = S"'( - A„Z, - (S - S)(^, - iv,) - Ecu, + e,). (24) 

By fl2^ and Lemma [H we obtain that with probabihty greater than 1 — 0{p~^ + n~^^^), 

\{± - S)(^, - a;,)U < Csplogp/n. (25) 

Thus, 



|A-U;,|oo<CMpl'^°^^ 



77, 



This proves ([10]). By 022]) and the inequahty ||r2-ri||2, < 2^^^^ |^,-a;,|2, we obtain ([TT]). 
Theorem [3] (i) foUows from the proof of Theorem [H Theorem [3] (ii) follows from Theorem 
[2] and the lower bound condition on 6p. | 
Proof of Theorem [H Let 



^; = argmin | VsJ/S - ej(3 + \\(3\A 



f3eiRp 



with A = Cy^logp/n E {Aj, 1 < i < N} and C is sufficiently large. Then by the proofs of 
Theorem [T] and [21 we have with probability greater than 1 — 0{p~^), 



'2 / ^. logP 



max 1/3. — ojjL < Cs„ . 

i<i<p -^ n 

By the definition of (3^ , we have 

-(A)^S^/3, - e^A < -{(3,f^l(3, - ej(3,. 

Set Di = f3^ — (jJi and Z?° = /3j — cJj. This implies that 

((S2 - S) A, A) + (S A, A) + 2(S2a;, - e„ fi] - ^") 
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We have by Lemma HJ 



|((S2 - S)A, A)| = Op(l)|A|^ ' ^°^^ 



n 



and 



(S^o;. - e.,^; - ^;) = Op{l)\^] - ^"b. /^°^^ 



n 



Thus, 



I Al^ < Op (i/^) (I Ab + I a: - U..I2) + I Al^- 

This proves the theorem. 

Lemma 4 For any vector Vi with \vi\2 = 1, we have 



max^|((i]2 - ^H,v,)\ = Op Ji^ 1 (26) 



and 



Proof of Lemma 3]. Note that 

((S2 - i:)v„ V,) = ((S-1/25.I5.-1/2 _ j)sV2^^^ s^/^^,). 

To prove (12B]) . without loss of generahty, we assume that S = /. Then S2 has the same 
distribution as ^ Xlfcl^ ^kV^, where Vk =■ {Vki, . . . , Vkp)'^, 1 < k < n2 — I, are inde- 
pendent A^(0,/) random vectors. Set Eg " ^ = ~(Xlfc=L ^kij)pxp and i> = (fi, . . . ,Vp)'^. 
We have 

-, "2- 1 

((S2 - S)v, '«^) = — XI XI ^i^i^fciJ 

fc=i i<j,j<p 



1 "2-1 2 



fc=i i<j<p 



(!26|) is proved by the tail probability of x^ distribution. (127|) follows from the exponential 
inequality in Lemma 1, Cai and Liu (2011). | 
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Proof of Lemma [T] The objective is equivalent to (after neglecting constant terms with 
respect to f3p) 

The minimizer of above should have a subgradient equal to zero, 

/3^pEi2 + /3pE22 - 1 {p = z} + A sign(/3p) = 0. 
Thus the solution is given by the thresholding rule 

/3p = r(l{p = z}-/3^pSi2,A)/S22. 
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